Mining Heterogeneous Transformations for Record Linkage

نویسندگان

  • Matthew Michelson
  • Craig A. Knoblock
چکیده

Heterogeneous transformations are translations between strings that are not characterized by a single function. E.g., nicknames, abbreviations and synonyms are heterogeneous transformations while edit distances are not. Such transformations are useful for information retrieval, information extraction and text understanding. They are especially useful in record linkage, where we determine whether two records refer to the same entity by examining the similarities between their fields. However, heterogeneous transformations are usually created manually and without assurance they will be useful. This paper presents a data mining approach to discover heterogeneous transformations between two data sets, without labeled training data. In addition to simple transformations, our algorithm finds combinatorial transformations, such as synonyms and abbreviations together. Our experiments demonstrate that we discover many types of specialized transformations, and we show that by exploiting these transformations we can improve record linkage. Our approach makes discovering and exploiting heterogeneous transformations more scalable and robust by lessening the domain and human dependencies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining the Heterogeneous Transformations between Data Sources to Aid Record Linkage

Heterogeneous transformations are translations between strings that are not characterized by a single function. E.g., nicknames, abbreviations and synonyms are heterogeneous transformations while edit distances are not. Such transformations are useful for information retrieval, information extraction and text understanding. They are especially useful in record linkage, where the problem is to d...

متن کامل

Privacy Preserving Record Linkage via grams Projections

Record linkage has been extensively used in various data mining applications involving sharing data. While the amount of available data is growing, the concern of disclosing sensitive information poses the problem of utility vs privacy. In this paper, we study the problem of private record linkage via secure data transformations. In contrast to the existing techniques in this area, we propose a...

متن کامل

MAL4:6 - Using Data Mining for Record Linkage

This paper presents a first attempt at using pedigree-based data to improve record linkage. It describes a composite metric for similarity and a mechanism to extract relevant generational features. Results on a large data set demonstrate promise.

متن کامل

Record Linkage: Current Practice and Future Directions

Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the “standard” probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federa...

متن کامل

High-Performance Computing Techniques for Record Linkage

The task of linking together information from one or more data sources representing the same entity (patient, customer, provider, business, etc.) If no unique identifier is available, probabilistic linkage techniques have to be applied Applications of record linkage Remove duplicates in a data set (internal linkage) Merge new records into a larger master data set Create patient oriented statist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007